From Surveys to Populations
GVPT399F: Power, Politics, and Data
Surveys
Populations are very difficult to collect data on
- Even the census misses people!
Happily, we can use surveys of a sample of our population to learn things about our population
However, our ability to do this is conditional on how good our sample is
What do I mean by “good”?
The 2024 US Presidential Election
- Elections are preceded by a flood of surveys
Parallel worlds
Remember back to last session on experiments
In an ideal world, we would be able to create two parallel worlds (one with the treatment, one held as our control)
These worlds are perfectly identical to each other prior to treatment
We cannot do this :(
The next best thing
Our next best option is to create two groups that were as identical to one another as possible prior to treatment
If they are (almost) identical, differences between their group-wide outcomes can be attributed to the treatment
One good way of getting two (almost) identical groups is to assign individuals to those groups randomly
- Think back to our 1,000 hypothetical people!
Randomization
Randomization continues to pop its chaotic head up
We can use it to create a sample that is (almost) identical to our population, on average
Drawing randomly from our population increases our chances of ending up with a sample that reflects that population
This would be referred to as a representative sample
To illustrate
Countries’ GDP in 2022:
Countries’ GDP
I want to estimate the average GDP across all countries in 2022.
sample_df <- gdp_df |>
drop_na(sample_value) |>
sample_n(size = 60) |>
transmute(country, gdp = sample_value)
sample_df
# A tibble: 60 × 2
country gdp
<chr> <dbl>
1 Chad 1.24e10
2 Angola 1.04e11
3 Kyrgyz Republic 1.21e10
4 Sweden 5.80e11
5 Turkiye 9.07e11
6 Estonia 3.84e10
7 Indonesia 1.32e12
8 Eswatini 4.70e 9
9 Uzbekistan 9.01e10
10 Montenegro 6.23e 9
# ℹ 50 more rows
Countries’ GDP
I now calculate the average of these responses, which I find to be:
sample_df |>
summarise(avg_gdp = scales::dollar(mean(gdp, na.rm = T)))
# A tibble: 1 × 1
avg_gdp
<chr>
1 $447,549,763,396
Now, imagine that we knew definitively that it was NA. Why such a large difference?
Non-response bias
Poorer countries are far less likely to be able or willing to provide these economic data to academics or international organizations.
- They tend to be underrepresented in a lot of data
My sample was biased against poorer countries.
- They were not equally likely to respond to my request for data as rich countries